Skip to content

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding"#38359

Merged
tlrmchlsmth merged 3 commits intovllm-project:mainfrom
elvircrn:revert-mla-zero-init
Apr 1, 2026
Merged

[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding"#38359
tlrmchlsmth merged 3 commits intovllm-project:mainfrom
elvircrn:revert-mla-zero-init

Conversation

@elvircrn
Copy link
Copy Markdown
Contributor

@elvircrn elvircrn commented Mar 27, 2026

Summary

  • Restores the original torch.empty allocation, removing the overhead of pre-allocated zero-init buffers and the out= workaround in FlashInfer MLA.

Test plan

  • Run GB200 DeepSeek-R1 NVFP4 decode with CUDA graph padding — verify no NaN
  • Verify no performance regression from removing the pre-allocated buffer

🤖 Generated with Claude Code

…N from CUDA graph padding (vllm-project#37442)"

This reverts commit ef2c4f7.

The zero-init workaround is unnecessary — the NaN was caused by a
different issue (int64 expert IDs in the routing simulator). Reverting
to restore the original torch.empty allocation which avoids the
overhead of pre-allocated zero-init buffers.

Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@elvircrn elvircrn requested a review from pavanimajety as a code owner March 27, 2026 12:18
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify bot added nvidia v1 bug Something isn't working labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes pre-allocated output buffers and simplifies tensor allocation logic across the CUTLASS and FlashInfer MLA backends. In cutlass_mla.py, the _decode_out buffer is replaced with a direct new_empty allocation, while in flashinfer_mla.py, the manual buffer management and padding zeroing workarounds in forward_mqa are removed. I have no feedback to provide.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026
Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helped but did not address the core issue. Related: #38148 has a real NaN fix but is insufficient

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Apr 1, 2026
@tlrmchlsmth tlrmchlsmth merged commit 5e30e9b into vllm-project:main Apr 1, 2026
60 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Apr 1, 2026
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 3, 2026
…N from CUDA graph padding" (vllm-project#38359)

Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants